feat: Replace mutable buffers with immutable Arrow vectors in NativeBatchReader by andygrove · Pull Request #3382 · apache/datafusion-comet

andygrove · 2026-02-03T23:12:45Z

Rationale

I would like to remove the remaining mutable buffer use from Comet so that we can use Arrow FFI best practices.

Summary

Add ImmutableConstantColumnReader that creates Arrow vectors directly in Java without using native Rust mutable buffers, used for partition columns and missing columns in NativeBatchReader
Support primitive types: Boolean, Byte, Short, Integer, Long, Float, Double, String, Binary, Date, Timestamp, Decimal, Null
CometScanRule checks partition column types at planning time and falls back to Spark if complex types (StructType, ArrayType, MapType) are used

Test plan

Existing tests pass
Manual testing with partition columns of various primitive types

🤖 Generated with Claude Code

…mentation The documentation incorrectly claimed that native_iceberg_compat "removes the use of reusable mutable-buffers". In reality, both native_comet and native_iceberg_compat use reusable mutable buffers when transferring data via Arrow FFI. This commit: - Removes the inaccurate claim - Replaces it with accurate description of Parquet decoding delegation - Adds a note explaining the actual mutable buffer behavior - Links to the FFI documentation for details on arrow_ffi_safe flag Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Clarified note on mutable buffers and updated details on `native_iceberg_compat` implementation.

…ader Add ImmutableConstantColumnReader that creates Arrow vectors directly in Java without using native Rust mutable buffers. This is used for partition columns and missing columns in NativeBatchReader. Key changes: - New ImmutableConstantColumnReader creates Arrow vectors using Arrow Java APIs, supporting primitive types (Boolean, Byte, Short, Integer, Long, Float, Double, String, Binary, Date, Timestamp, Decimal, Null) - NativeBatchReader now uses ImmutableConstantColumnReader instead of ConstantColumnReader for partition and missing columns - CometScanRule checks partition column types at planning time and falls back to Spark if complex types (StructType, ArrayType, MapType) are used, since ImmutableConstantColumnReader only supports primitives Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

codecov-commenter · 2026-02-03T23:25:40Z

Codecov Report

❌ Patch coverage is 2.18579% with 179 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.38%. Comparing base (f09f8af) to head (6bdfc9e).
⚠️ Report is 922 commits behind head on main.

Files with missing lines	Patch %	Lines
...e/comet/parquet/ImmutableConstantColumnReader.java	0.00%	171 Missing ⚠️
...n/scala/org/apache/comet/rules/CometScanRule.scala	40.00%	4 Missing and 2 partials ⚠️
...va/org/apache/comet/parquet/NativeBatchReader.java	0.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3382      +/-   ##
============================================
+ Coverage     56.12%   59.38%   +3.25%     
- Complexity      976     1463     +487     
============================================
  Files           119      176      +57     
  Lines         11743    16358    +4615     
  Branches       2251     2728     +477     
============================================
+ Hits           6591     9714    +3123     
- Misses         4012     5288    +1276     
- Partials       1140     1356     +216

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove · 2026-02-04T19:05:16Z

Moving this to draft until we have benchmarks to ensure there is no regression with this change

Instead of materializing full N-element Arrow arrays for partition and missing columns, export 1-element arrays from JVM and expand them on the native side using ScalarValue. This avoids O(N) memory allocation and copying for constant columns. - Add CometConstantVector: stores a single value, lazily creates a 1-element Arrow vector for FFI export, returns constant for all rowIds - Modify ImmutableConstantColumnReader to produce CometConstantVector - Add CometConstantVector case in NativeUtil.exportBatch() to skip row count validation for 1-element vectors - In scan.rs, detect 1-element arrays and expand via ScalarValue when actual_num_rows > 1; skip take() for scalar columns with selection vectors since constants are unaffected by row deletion Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Adds a benchmark that writes a partitioned parquet table and measures scan performance with 1 and 5 partition columns. Tests both reading data columns alongside partitions and reading partition columns themselves. This exercises the CometConstantVector → native scalar expansion path. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Adds a standalone benchmark that writes partitioned parquet tables and measures scan performance with 1 and 5 partition columns. Tests both reading data columns alongside partitions and reading partition columns themselves. This exercises the CometConstantVector path where constant columns are exported as 1-element Arrow arrays and expanded on the native side. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

andygrove · 2026-02-05T16:00:25Z

This approach is too slow. I am going to try some other approaches.

andygrove and others added 5 commits February 3, 2026 11:22

Update parquet_scans.md with mutable buffers note

6463273

Clarified note on mutable buffers and updated details on `native_iceberg_compat` implementation.

format

0cc8fbe

Revert documentation changes to parquet_scans.md

6bdfc9e

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

andygrove changed the title ~~Replace mutable buffers with immutable Arrow vectors in NativeBatchReader~~ feat: Replace mutable buffers with immutable Arrow vectors in NativeBatchReader Feb 3, 2026

andygrove marked this pull request as ready for review February 4, 2026 00:57

andygrove requested a review from parthchandra February 4, 2026 16:40

andygrove marked this pull request as draft February 4, 2026 19:05

andygrove and others added 6 commits February 5, 2026 07:55

Fix clippy warning: use iterator instead of index loop

befc755

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix partitionBy to pass column names as separate arguments

bd6eb62

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Benchmark native_datafusion and native_iceberg_compat only

b2ded2a

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

andygrove closed this Feb 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Replace mutable buffers with immutable Arrow vectors in NativeBatchReader#3382

feat: Replace mutable buffers with immutable Arrow vectors in NativeBatchReader#3382
andygrove wants to merge 11 commits into
apache:mainfrom
andygrove:immutable-constant-column-reader

andygrove commented Feb 3, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Feb 3, 2026 •

edited

Loading

Uh oh!

andygrove commented Feb 4, 2026

Uh oh!

andygrove commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale

Summary

Test plan

Uh oh!

codecov-commenter commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove commented Feb 4, 2026

Uh oh!

andygrove commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andygrove commented Feb 3, 2026 •

edited

Loading

codecov-commenter commented Feb 3, 2026 •

edited

Loading